36003 dotai embed block editor story block fields as markdown instead of html stripped text#36366
Conversation
…ipped text (#36003) parseBlockEditor now returns StoryBlockMap.toMarkdown() directly instead of rendering to HTML and stripping it with Tika. Markdown is already plain text and preserves the structure (tables, code blocks, lists, headings) that the Tika path flattened away. The markdown is returned raw -- not routed through parseText (collapses newlines) or parseHTML (Tika re-strips) -- so newline-delimited structure survives the extraction layer. Adds ContentToStringUtilTest asserting a Story Block with a table and a fenced code block extracts with that structure intact, and registers it in MainSuite3a.
…i-embed-block-editor-story-block-fields-as-markdown-instead-of-html-stripped-text # Conflicts: # dotcms-integration/src/test/java/com/dotcms/MainSuite3a.java
|
Claude finished @hassandotcms's task in 1m 24s —— View job Rollback Safety Analysis
Result: ✅ Safe To RollbackScope of changes:
Analysis against every unsafe category:
Reasoning: |
🤖 dotBot Review (Bedrock)Reviewed 3 file(s); 1 candidate(s) → 1 confirmed, 0 uncertain (unverified, kept for review). Confirmed findings
us.deepseek.r1-v1:0 · Run: #28451883674 · tokens: in: 14379 · out: 3070 · total: 17449 · calls: 6 · est. ~$0.036 |
What
dotAInow embeds Story Block (Block Editor) fields as Markdown instead of rendering them to HTML and stripping the markup with Tika.ContentToStringUtil.parseBlockEditorreturnsStoryBlockMap.toMarkdown()directly — no Tika/HTML round-trip. Markdown is already plain text and preserves structure (tables, code blocks, lists, headings) that the Tika path flattened.Closes #36003.
Why
Tika flattened tables, code, lists, and headings to plain text, giving the embedding model a worse representation. Markdown keeps that structure.
Changes
ContentToStringUtil.parseBlockEditor→ returnstoMarkdown()raw (not viaparseText/parseHTML, which would re-collapse/re-strip the structure).ContentToStringUtilTest— asserts a Story Block with a table + fenced code block extracts with structure intact; registered inMainSuite3a.